5. Main Analysis
Scatter plots
The first Scatter Plot shows the relationship between the number of delivered messages and the number of opened messages. The second scatter plot shows the relationship between the number or delivered messages and the number of opt outs. Each circle in the plot represents a send job. I also added a smooth line to help determine overall trends.
In both cases I see a positive, somehow strong linear relationship between the two variables (Sends vs Opens and Sends vs Opt outs), I can corroborate some of the findings described in the Executive Summary chapter. I see many Red and Green above the smooth line for Sends vs. Opt outs. I see that Red tends to have relatively small batches of sent messages (less than 170,000 users per batch) but I see two outliers with more than 500,000 users in batch that caused about 600 users each to click an opt out link, way above the average opt out rate.
The plots are very dense in the lower third of the graph and it is hard to see what is going on when send jobs are small to medium in size.
(*)Removed outlier message_id=210650080 for better visibility
message_action <- mail_summary_with_details %>% mutate(business_area=reorder(business_area,total_messages, sum))%>% filter(message_id!=210650080)%>% filter(action!="clicked")%>% group_by(message_id, action, business_area) %>% dplyr::summarise(total_messages= sum(total_messages)) %>% spread(action, total_messages, fill = 0) %>% gather(key=action, value=total_messages, -message_id, -delivered, -business_area) %>% mutate(action = factor(action, levels=c("opened" ,"clicked" , "opt_out")))
facet_label<-c(delivered="Delivers (in 10,000 messages)",opened="Opens", "opt_out"="Opt outs")
message_action %>% mutate(delivered=delivered/10000)%>% ggplot(aes(delivered,total_messages)) + geom_point(aes(col = business_area), shape=21, alpha=0.8) + geom_smooth(color = "black", size = 0.5, se=FALSE) + facet_wrap(~action,ncol = 1, scales='free_y', labeller = as_labeller(facet_label),switch='y') + scale_color_manual(values=colorRampPalette(brewer.pal(10, "Paired"))(10)) + theme_minimal() + theme(legend.position = "bottom", legend.title = element_blank(), legend.direction = "horizontal", legend.text = element_text(margin = margin(2, unit = "pt")),strip.placement = "outside", axis.title.y=element_blank(), axis.title.x=element_text(size=10)) + guides(colour = guide_legend(reverse = TRUE,nrow = 1, override.aes = list(shape = 19, alpha=1))) + labs(title="Sends (in 10,000 messages) vs. Opens and Opt outs",x = "Sends (in 10,000 messages)")
Parallel coordinates chart
I added an interactive parallel coordinate chart to be able to analyze different variables at the same time. I changed the coloring variable, reordered columns, sliced the data, worked with a sample set and full set, filtered the data to remove outliers, all of this to try to find relationships or patterns in the data. For this report I decided to use quantile= ntile(opened_r,4) as the coloring variable. The ntile function allowed me to split the data is 4 groups of identical size based on the value of opened_r (open rate) variable. I also took a sample to 500 rows to improve performance.
It was hard to find dependencies and correlations in the data. However, one thing stood out, the open rate for the second group showed a narrow range around the 20% mark.
Sampling the data had a negative effect. The first group is mostly imperceptible since most of the sampled data in that group has zero open rate.
Here is the plot. Feel free to explore on your own.
message_specialty_rates_sample<-mail_summary_with_details %>% mutate(business_area=reorder(business_area,total_messages, sum),specialty=reorder(specialty,total_messages, sum))%>% group_by(message_id, business_area,action, specialty) %>% dplyr::summarise(messages=sum(total_messages)) %>% spread(action,messages, fill=0) %>% ungroup() %>% mutate(opens_r=opened*100/delivered, opt_out_r=opt_out*10000/delivered, click_r=clicked*10000/delivered) %>% ungroup() %>% select(-message_id) %>% filter(click_r<5000, opens_r<100, opt_out_r<200) %>% sample_n(500)%>% mutate(quantile=ntile(opens_r, 4))
parcoords(message_specialty_rates_sample,
rownames = F # turn off rownames from the data.frame
, brushMode = "1D-axes"
, reorderable = T
, queue = T
, color = list(
colorBy = "quantile"
,colorScale = htmlwidgets::JS("d3.scale.category10()")
)
)
Violin plots with jitters
I was curious to see the distribution of some variables by business area. I though violin plots will be a good visualization tool for this purpose.
Each circle in the jitters represents a send job. We can estimate the number of send jobs by the density of the jitters.
The first set of violin plots shows open rates by business area. You can see business areas with low open rate variance (Example: Orange and Red) and high open rate variance (Pink and Blue). The plots also show some bi-modal distribution which will be interesting to explore further.
The second set of violin plots shows opt out rate by business area. The plots show more variation of the mean of each distribution, indicated by the red dot. Some business areas have very low opt out rate mean such as Red, some have very high, such as Blue - outside the plot area. In both plots I used the coord_cartesian function in ggplot to remove outliers
message_summary_rates<-mail_summary_with_details %>% mutate(business_area=reorder(business_area,total_messages, sum))%>% group_by(message_id, business_area,action) %>% dplyr::summarise(messages=sum(total_messages)) %>% spread(action,messages, fill=0) %>% ungroup() %>% mutate(opens_r=opened*100/delivered, opt_out_r=opt_out*100/delivered, click_r=clicked*100/delivered)
message_summary_rates %>% ggplot(aes(x=factor(business_area, levels = rev(levels(business_area))), y= opens_r))+
geom_point(position="jitter", aes(col=business_area), shape=21,alpha = 0.2,show.legend = FALSE) + geom_violin(trim=FALSE, alpha=0) + stat_summary(fun.data=mean_sdl,
geom="pointrange", color="red") + scale_color_manual(values=colorRampPalette(brewer.pal(10, "Paired"))(10)) + labs(title="Violin Plots showing density distribution of Open Rate by Business Area",x="", y = "Open Rate (per 100 Sends)") +
theme_classic() +theme(axis.text.x = element_text(angle=30,hjust=1)) + coord_cartesian(ylim=c(0,50))
message_summary_rates %>% ggplot(aes(x=factor(business_area, levels = rev(levels(business_area))), y= opt_out_r))+
geom_point(position="jitter", aes(col=business_area), shape=21,alpha = 0.2,show.legend = FALSE) +
geom_violin(alpha=0, adjust = 32) + stat_summary(fun.data=mean_sdl,
geom="pointrange", color="red")+ scale_color_manual(values=colorRampPalette(brewer.pal(10, "Paired"))(10)) + labs(title="Violin Plots showing density distribution of Opt Out Rate by Business Area",x="", y = "Opt outs Rate (per 10,000 Sends)") +
theme_minimal() +theme(axis.text.x = element_text(angle=30,hjust=1)) + coord_cartesian(ylim=c(0,0.8))
Histograms
In an effort to find what causes bimodalily in the open rate distribution, I drew the open rate frequency distribution with the help of the histogram function in ggplot. I used animation to explore how different business areas contributed to the overall histogram. The graphs show that the business areas responsible for the higher number of messages have a well defined bell curve for their open rate frequency distribution with low variance. On the other hand, the business areas that generate the least amount of messages, are less predictable with high variance.
It is worth noting that the business areas with higher mean correspond to business areas with subscription based email policies (Orange, Red, Pink) and the ones with lower mean correspond to business areas that send messages to any user who has not opted out(Light Blue and Green).
I had difficulties with the library gganimate, the flag cumulative = TRUE was not working. To compensate for that, I chose to add the full distribution in the background with low opacity.
histogram_data<-mail_summary_with_details %>% inner_join((mail_summary_with_details %>% filter(action=='delivered' & total_messages >100) %>% group_by(message_id) %>% dplyr::summarise(delivered=sum(total_messages))), by ="message_id") %>% mutate(business_area=reorder(business_area,total_messages, sum)) %>% filter(action=='opened') %>%group_by(message_id,business_area, delivered) %>% dplyr::summarise(opened=sum(total_messages) )%>% mutate(opens_r=round(opened*100/delivered,2))
p2<-ggplot(histogram_data,aes(x=opens_r, cumulative=T)) + geom_histogram(aes(fill=business_area),bins=1000,alpha=0.2 )+ geom_histogram(aes(fill=business_area,frame=business_area, cumulative=TRUE),bins=1000 ) + theme_minimal() + scale_fill_manual(values=colorRampPalette(brewer.pal(10, "Paired"))(10)) +coord_cartesian(xlim = c(0,50)) + guides(fill=FALSE) + labs(title="Histogram showing Open Rate Frequency Distribution\n\n", x="Open Rate", y="Open Rate Frequency")
gganimate(p2)
ggplot(histogram_data,aes(x=opens_r, cumulative=T)) + geom_histogram(aes(fill=business_area),bins=1000,alpha=0.2 )+ geom_histogram(aes(fill=business_area),bins=1000 ) +facet_wrap(~business_area, nrow=2) + theme_minimal() + scale_fill_manual(values=colorRampPalette(brewer.pal(10, "Paired"))(10)) +coord_cartesian(xlim = c(0,50)) + guides(fill=FALSE) + labs(title="Histogram showing Open Rate distribution\n\n", x="Open Rate", y="Open Rate Frequency")
Heatmaps
I had explored how different metrics were affected by business area and by specialty. With a heatmap I am hoping to get some insight on how both variables affect opt out rates. The plots are dense with information therefore I used an interactive heatmap be able to zoom in and see details. The color of the tile indicates the opt out rate value - dark blue represents the lowest and yellow, the highest rate. Opt outs weight more if the number of delivered emails is small. I chose to enhance the information provided in he tooltip when hovering over a tile. This allows the person interacting with the plot to judge if the opt out rate has merit or can be ignored.
I found that it was more informative to scale the values by columns and rows separately. It helps highlight difference in variable performance much better
The plots are read differently. In the first case the opt out rate is scaled for each specialty individually, therefore you focus in one column (specialty) and see how users in this group interact with messages from different business areas. In contrast, in the second plot the rows are scaled individually and the plot is read by row. Columns and rows were sorted by the average of out out rate. This helps the reader to see areas with high performers and areas with low performers.
Let me give an example on how to read first plot using the Chicken specialty. Users with this specialty perform the worst with messages from business area Blue and the best for business area Orange.
It is worth noting that most specialties perform best when interacting with messages from business area Orange and worse with business area Green, a fact that can be seen in the Executive Summary with a different type of graph.
A similar analysis can be done for Open Rate and Click Rate to evaluate how business area and specialty contribute to this variables. I are not including these extended analysis in this report.
specialty_business_area_opt_out_scaled<-mail_summary_with_details %>% dplyr::filter(!is.na(specialty))%>% group_by(specialty, business_area ,action) %>% dplyr::summarise(messages= sum(total_messages))%>%spread(action,messages,fill = 0) %>% mutate(opt_out_r=opt_out*100/delivered) %>% mutate(delivered=delivered/1000) %>% ungroup() %>% mutate(specialty=reorder(specialty,desc(opt_out_r),mean),business_area=reorder(business_area,desc(opt_out_r),mean)) %>% mutate(Sends = paste(delivered*1000, "\nOpt outs:", opt_out, "\nOpt out rate:", round(opt_out_r,2))) %>% arrange(specialty, opt_out_r) %>% group_by(specialty) %>% mutate(by_specialty=rescale(opt_out_r)) %>%ungroup() %>% group_by(business_area) %>% mutate(by_business_area=rescale(opt_out_r))
q<-specialty_business_area_opt_out_scaled %>% ggplot(aes(y=business_area,x= specialty, fill = by_specialty, label=Sends)) + geom_tile(show.legend = TRUE) + theme_minimal() + theme(axis.text.x = element_text(size = 6, angle = 30, hjust=1), axis.text.y = element_text(size = 6,angle = 30, hjust=1), axis.title=element_blank(), plot.margin = unit(c(t = 0, r = 50, b = 80, l = 5), "pt")) + scale_fill_viridis() +labs(fill="", title= "Heatmap with opt out rate scaled by column (by specialty)")
ggplotly(q, tooltip = c("y", "x", "label"))
q<-specialty_business_area_opt_out_scaled %>% ggplot(aes(y=business_area,x= specialty, fill = by_business_area, label=Sends)) + geom_tile(show.legend = TRUE) + theme_minimal() + theme(axis.text.x = element_text(size = 6, angle = 30, hjust=1), axis.text.y = element_text(size = 6,angle = 30, hjust=1), axis.title=element_blank(), plot.margin = unit(c(t = 0, r = 50, b = 80, l = 5), "pt")) + scale_fill_viridis() +labs(fill="", title= "Heatmap with opt out rate scaled by row (by business area)")
ggplotly(q, tooltip = c("y", "x", "label"))